IntroductionΒΆ

The primary goal of this work is to explore and forecast rainfall in Bangalore Rural district. The secondary goal is to compare the forecast of a simple SARIMA model with that of Prophet.

Explore dataΒΆ

InΒ [1]:
from IPython.display import display, Markdown
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.offline as po
import plotly.graph_objects as go
from prophet import Prophet
from scipy import stats
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
from sklearn.model_selection import train_test_split
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller

po.init_notebook_mode()

rain_data = pd.read_csv('daily-rainfall-data-district-level.csv', parse_dates=["date"])
InΒ [2]:
# Sanity check
rain_data.describe()
Out[2]:
id date state_code district_code actual rfs normal deviation
count 3.577399e+06 3577399 3.577399e+06 3.577399e+06 3.437512e+06 3.535587e+06 2.911744e+06 2.431767e+06
mean 1.788699e+06 2016-10-15 22:55:50.544908032 1.774312e+01 3.364497e+02 3.401141e+00 5.302644e-01 3.569884e+00 3.624293e+01
min 0.000000e+00 2009-01-01 00:00:00 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -1.000000e+02
25% 8.943495e+05 2012-11-23 00:00:00 9.000000e+00 1.630000e+02 0.000000e+00 0.000000e+00 2.000000e-01 -1.000000e+02
50% 1.788699e+06 2016-10-16 00:00:00 1.900000e+01 3.220000e+02 0.000000e+00 0.000000e+00 1.100000e+00 -1.000000e+02
75% 2.683048e+06 2020-09-08 00:00:00 2.400000e+01 4.790000e+02 1.630000e+00 1.758328e-01 5.200000e+00 -4.588500e+01
max 3.577398e+06 2024-07-31 00:00:00 3.800000e+01 9.999000e+03 4.929900e+02 2.806424e+02 1.548000e+02 1.745600e+05
std 1.032706e+06 NaN 9.556991e+00 4.266907e+02 1.011035e+01 2.031809e+00 5.599918e+00 8.467325e+02

Here is a brief description of these columns:

  • id: Row Identifier
  • state_code and district_code: Codes assigned by Indian Meteorological Department
  • actual: Amount of rainfall recorded (in mm)
  • rfs : Amount of rainfall storage (in mm)
  • normal: Expected amount of rainfall (in mm)
  • deviation: Deviation between rfs and normal

Out of these columns, we are interested in the "actual" rainfall for the forecast.

InΒ [3]:
# Missing data analysis
total_actual = rain_data['actual'].count()
missing_actual = rain_data['actual'].isnull().sum()
percent_missing = missing_actual/total_actual

display(Markdown(f"### Missing rows: {missing_actual}/{total_actual} . Percentage: {percent_missing:.2%}"))

Missing rows: 139887/3437512 . Percentage: 4.07%ΒΆ

We would need replace these missing values while cleaning data.

InΒ [4]:
# Distribution analysis
fig = px.histogram(rain_data, x='actual', nbins=30, title="Distribution of Actual Rainfall")
fig.update_layout(xaxis_title="Actual Rainfall (mm)", yaxis_title="Frequency")
fig.show()

It seems that there are some extreme outliers. We need to get rid of these for a sensible prediction model. Moreover, the data is skewed toward 0, Box Cox transformation would help fit a decent model.

Clean dataΒΆ

  • Since we are interested only in the actual rainfall and nothing else, drop these along with other irrelevant columns.
  • We impute the missing data using interpolation since it is likely to rain at least a little between two rainy days.
  • We handle outliers by using Interquartile Range - remove the data points beyond three quartiles from the midpoint.
InΒ [5]:
def clean_data(df):
    df = df.drop(columns=['id','state_code','district_code','rfs','deviation','normal'])
    df = df.sort_values(by='date')
    # Fill missing values by interpolation
    df['actual'] = df['actual'].interpolate(method='linear')
    df['actual'] = df['actual'].clip(lower=0)

    # Using IQR for outlier detection
    Q1 = df['actual'].quantile(0.25)
    Q3 = df['actual'].quantile(0.75)
    IQR = Q3 - Q1

    # Define outlier range
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Filter data
    df = df[(df['actual'] >= lower_bound) & (df['actual'] <= upper_bound)]

    df = df.dropna()

    # Set date as index
    df = df.set_index('date')
    return df

rain_data = clean_data(rain_data.copy())

Filter districtΒΆ

InΒ [6]:
# Select desired district
district = "Bengaluru Rural"
district_rain = rain_data[rain_data['district_name']==district].copy()
InΒ [7]:
fig = px.bar(district_rain, y="actual",title=f"Daily Rainfall Over Time in {district}")
# Update axis titles
fig.update_layout(
    xaxis_title="Date of Observation",  # Custom x-axis label
    yaxis_title="Rainfall (mm)"  # Custom y-axis label
)

fig.show()

Predict with SARIMA modelΒΆ

ARIMA has three components:

  • AutoRegression (p) accounting lag and white noise
  • Differencing (d), to make timeseries stationary (constant mean and variance)
  • Moving Average (q) to model dependency between observations and residual errors and to account for white noise.

Note that ARIMA works on stationary timeseries. To check stationarity, we can use Augmented Dickey-Fuller test.

SARIMA adds four seasonal parameters to the basic ARIMA model for time series with seasonal effects:

  • P (Seasonal AR): Auto-regressive terms for the seasonal component.
  • DD (Seasonal Differencing): The number of seasonal differences to make the series stationary with respect to seasonality.
  • QQ (Seasonal MA): Moving average terms for the seasonal component.
  • mm: The seasonal cycle's length.

In case of rainfall and to create a simple SARIMA model,

  • Yearly seasonality. mm = 12
  • Rain is non stationary and hence first differencing is needed. d = 1
  • Commonly used parameters p = q = Q = 1, P = 0
InΒ [8]:
# Split data
district_data = district_rain.reset_index()[['date', 'actual']]
# Augmented Dickey-Fuller test
result = adfuller(district_data['actual'].dropna())
display(Markdown(f"### ADF Statistic: {result[0]:.2} with p-value: {result[1]:.2}"))

ADF Statistic: -7.7 with p-value: 1.8e-11ΒΆ

InΒ [9]:
district_data['transformed'], lam = stats.boxcox(district_data['actual'] + 1) 
print(f"Lamdba for Box Cox: {lam}")

# Sort the data since date is no longer the index
district_data.sort_values(['date'],inplace=True)
train, test = train_test_split(district_data, test_size=0.25, shuffle=False)

# Display train-test split
display(Markdown(f"### Training: {len(train)} rows,Testing: {len(test)} rows"))

# SARIMA parameters
p,d,q = 1, 1, 1
P,D,Q,m = 0, 1, 1, 12
# Fit SARIMA on training data
model = SARIMAX(train['transformed'], 
                order=(p, d, q), 
                seasonal_order=(P, D, Q, m))
model_fit = model.fit(disp=False)

# Summary of the model
print(model_fit.summary())
Lamdba for Box Cox: -3.7510736795445982

Training: 3585 rows,Testing: 1196 rowsΒΆ

c:\Users\nmanj\.pyenv\pyenv-win\versions\3.12.5\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                        transformed   No. Observations:                 3585
Model:             SARIMAX(1, 1, 1)x(0, 1, 1, 12)   Log Likelihood                3587.755
Date:                            Thu, 12 Dec 2024   AIC                          -7167.509
Time:                                    21:40:21   BIC                          -7142.786
Sample:                                         0   HQIC                         -7158.695
                                           - 3585                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.2494      0.015     16.515      0.000       0.220       0.279
ma.L1         -0.9281      0.007   -142.311      0.000      -0.941      -0.915
ma.S.L12      -0.9999      0.189     -5.280      0.000      -1.371      -0.629
sigma2         0.0077      0.001      5.281      0.000       0.005       0.011
===================================================================================
Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):               180.22
Prob(Q):                              0.93   Prob(JB):                         0.00
Heteroskedasticity (H):               0.97   Skew:                             0.53
Prob(H) (two-sided):                  0.61   Kurtosis:                         3.27
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
InΒ [10]:
# Forecast for the test period
forecast = model_fit.forecast(steps=len(test))
forecast_original_scale = ((forecast * lam + 1) ** (1 / lam)) - 1
test['forecast'] = forecast_original_scale.values
# Set missing values in 'forecast' to 0
test['forecast'] = test['forecast'].fillna(0)

# Plot the forecast 
fig = go.Figure()

# Add forecast data
fig.add_trace(go.Scatter(
    x=test['date'], y=test['forecast'],
    mode='markers',
    name='Forecast',
    marker=dict(color='lightblue')
))

# Add test data to the plot
fig.add_trace(go.Scatter(
    x=test['date'],
    y=test['actual'],
    mode='markers',
    name='Test Data',
    marker=dict(color='gray')
))

fig.update_layout(
    xaxis_title="Date",
    yaxis_title="Rainfall (mm)"
)

fig.show()

Evaluate SARIMA predictionsΒΆ

InΒ [11]:
# Evaluate
mae = mean_absolute_error(test['actual'], test['forecast'])
rmse = root_mean_squared_error(test['actual'], test['forecast'])

mad_sarima = (np.abs(test['actual'] - test['forecast']))

display(Markdown(f"### MAE: {mae:.2} , RMSE: {rmse:.2}"))

MAE: 0.9 , RMSE: 1.1ΒΆ

Predict with ProphetΒΆ

InΒ [12]:
# Rename columns for Prophet
district_df = district_rain.reset_index()[['date', 'actual']]

# Box Cox
district_df['transformed'], lam = stats.boxcox(district_df['actual'] + 1) 
print(f"Lamdba for Box Cox: {lam}")

district_df.drop('actual', axis=1, inplace=True)

district_df.rename(columns={'date': 'ds', 'transformed':'y'},inplace=True)
# Sort the data since date is no longer the index
district_df.sort_values(['ds'],inplace=True)
train, test = train_test_split(district_df, test_size=0.25, shuffle=False)
# Display train-test split
print(f"Training set: {len(train)} rows, Testing set: {len(test)} rows")
Lamdba for Box Cox: -3.7510736795445982
Training set: 3585 rows, Testing set: 1196 rows
InΒ [13]:
test.tail()
Out[13]:
ds y
4776 2024-07-27 0.0
4777 2024-07-28 0.0
4778 2024-07-29 0.0
4779 2024-07-30 0.0
4780 2024-07-31 0.0
InΒ [14]:
# Train
model = Prophet()
model.fit(train)
21:40:21 - cmdstanpy - INFO - Chain [1] start processing
21:40:21 - cmdstanpy - INFO - Chain [1] done processing
Out[14]:
<prophet.forecaster.Prophet at 0x2457838a240>
InΒ [15]:
# future df
future = test[['ds']].copy()  # Use test dates as the future DataFrame
print(future.tail())  # Check the future dates
             ds
4776 2024-07-27
4777 2024-07-28
4778 2024-07-29
4779 2024-07-30
4780 2024-07-31
InΒ [16]:
forecast = model.predict(future)

def inverse_boxcox(y_transformed, lam):
    return ((y_transformed * lam) + 1) ** (1 / lam) - 1

for col in forecast.columns:
    if col != 'ds':
        forecast[col] = inverse_boxcox(forecast[col], lam)

# Apply a lower bound to the yhat values
forecast['yhat'] = forecast['yhat'].clip(lower=0)
forecast['yhat_lower'] = forecast['yhat_lower'].clip(lower=0)

print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())  # Display predictions
             ds      yhat  yhat_lower  yhat_upper
1191 2024-07-27  0.272219    0.054225         NaN
1192 2024-07-28  0.239008    0.032293    1.264361
1193 2024-07-29  0.246546    0.040003    1.932589
1194 2024-07-30  0.263942    0.046425         NaN
1195 2024-07-31  0.257926    0.039977    4.100433

Evaluate Prophet predictionsΒΆ

InΒ [17]:
# Merge predictions with test set
results = pd.concat([test.set_index('ds'), forecast[['ds', 'yhat']].set_index('ds')], axis=1).reset_index()
# Evaluate
mae = mean_absolute_error(results['y'], results['yhat'])
rmse = root_mean_squared_error(results['y'], results['yhat'])

mad_prophet = np.abs(results['y'] - results['yhat'])

display(Markdown(f"### MAE: {mae:.2} , RMSE: {rmse:.2}"))

MAE: 0.084 , RMSE: 0.11ΒΆ

InΒ [18]:
component_plot = model.plot_components(forecast)
No description has been provided for this image

The above trends agree with the facts:

  • The amount of rainfall is increasing across the world due to climate change
  • Typically, Bangalore Rural experiences the highest rainfall during the Southwest Monsoon (June to September)

The day of the week when rain is most likely in Bangalore Rural seems to be Wednesday from the graph.

InΒ [19]:
# Plot the forecast 
fig = go.Figure()

# Add forecast data
fig.add_trace(go.Scatter(
    x=forecast['ds'], y=forecast['yhat'],
    mode='markers',
    name='Forecast',
    marker=dict(color='lightblue')
))

# Add test data to the plot
fig.add_trace(go.Scatter(
    x=test['ds'],
    y=test['y'],
    mode='markers',
    name='Test Data',
    marker=dict(color='gray')
))

fig.update_layout(
    xaxis_title="Date",
    yaxis_title="Rainfall (mm)"
)

fig.show()

Compare modelsΒΆ

InΒ [20]:
fig = go.Figure()

# Add Prophet's MAD line
fig.add_trace(go.Scatter(x=forecast['ds'], 
                         y=mad_prophet, 
                         mode='lines', 
                         name='Prophet',
                         line=dict(color='blue')))

# Add SARIMA's MAD line
fig.add_trace(go.Scatter(x=forecast['ds'], 
                         y=mad_sarima, 
                         mode='lines', 
                         name='SARIMA',
                         line=dict(color='red')))

# Add labels and title
fig.update_layout(
    title='Comparison of MAD: Prophet vs SARIMA (per time point)',
    xaxis_title='Date',
    yaxis_title='Mean Absolute Deviation (MAD)',
    showlegend=True
)

fig.show()

ConclusionΒΆ

It is possible to roughly forecast rainfall in Bangalore Rural district with both SARIMA and Prophet models. Even after adjusting for the skewed data, the forecast of a simple SARIMA model is not as good as that of Prophet. It has to be noted that the SARIMA model can be tweaked to be better - which is not done as the goal is to demonstrate usage of these models with no tweaking.